Business Intelligence Report: Revenue, Retention & Optimisation

Author

Freda Erinmwingbovo

Published

February 23, 2026

NovaMart Global: Business Intelligence Report

Prepared by: Freda Erinmwingbovo
Prepared for: Richard Okafor, CEO, NovaMart Global
Date: February 2026
Dataset: 541,909 transactions · 38 countries · 2010–2011

Business Objective

NovaMart Global is a UK-based online retailer operating across 38 countries. Despite holding over two years of rich transaction data, business decisions around pricing, stock management, and customer retention have historically been driven by intuition rather than evidence.

This report delivers a comprehensive data intelligence analysis across six business themes, transforming raw transaction records into actionable revenue strategies.

The Six Business Questions

# Theme Business Question
1 Customer Segmentation Who are our most valuable customers?
2 Churn Prediction Who are we losing and why?
3 Upsell & Cross-Sell What should we sell together?
4 Demand Forecasting How much will we sell and when?
5 Inventory Replenishment Are we holding the right stock?
6 Price Optimisation Are we pricing correctly?

Data Preprocessing

Before any analysis begins, the dataset must be thoroughly inspected, cleaned, and validated. Raw transaction data from retail systems typically contains cancellations, missing records, data entry errors, and structural inconsistencies that must be resolved before any reliable intelligence can be extracted.

This section documents every preprocessing decision made, ransparently and fully justified, so the analytical foundation is beyond question.

Library Imports

Show Code
# ── LIBRARY IMPORTS ──
# Core data manipulation
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from matplotlib.gridspec import GridSpec

# Date handling
from datetime import datetime, timedelta

# Machine learning
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
                             accuracy_score, roc_auc_score, roc_curve, 
                            mean_absolute_error, mean_squared_error)
from sklearn.cluster import KMeans

# Association rules (cross-sell)
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Time series forecasting
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal import seasonal_decompose


# Warnings
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully.")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
All libraries imported successfully.
pandas version: 2.3.3
numpy version: 2.4.2

Data Loading

Show Code
df = pd.read_excel('Online_Retail.xlsx', engine='openpyxl')

print("Dataset loaded successfully.")
print(f"\nShape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nFirst 5 rows:")
df.head()
Dataset loaded successfully.

Shape: 541,909 rows × 8 columns

Columns: ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']

Data Types:
InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

First 5 rows:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

Missing Values

Show Code
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)

missing_summary = pd.DataFrame({
    'Missing Count': missing,
    'Missing (%)': missing_pct
}).sort_values('Missing Count', ascending=False)

print("Missing Value Summary:")
display(missing_summary[missing_summary['Missing Count'] > 0])
Missing Value Summary:
Missing Count Missing (%)
CustomerID 135080 24.93
Description 1454 0.27

Data Cleaning

Missing CustomerID affects 24.93% of records. Dropping these outright would be analytically reckless — these records still contain valid product, quantity, price, and date information.

The approved approach is to split the dataset into two subsets: - df_customers: records with CustomerID, used for all customer-level analysis - df_transactions: full clean dataset regardless of CustomerID, used for all product and transaction-level analysis

No data is discarded unnecessarily. Every record contributes to the analysis it is capable of supporting.

Show Code
original_size = len(df)
print(f"Original dataset: {original_size:,} rows\n")

# Step 1: Remove cancelled transactions (InvoiceNo starting with C)
cancelled_mask = df['InvoiceNo'].astype(str).str.startswith('C')
df = df[~cancelled_mask]
print(f"After removing cancellations:        {len(df):,} rows "
      f"(removed {cancelled_mask.sum():,})")

# Step 2: Remove negative or zero quantities
df = df[df['Quantity'] > 0]
print(f"After removing invalid quantities:   {len(df):,} rows")

# Step 3: Remove zero or negative unit prices
df = df[df['UnitPrice'] > 0]
print(f"After removing invalid prices:       {len(df):,} rows")

# Step 4: Remove missing descriptions
df = df.dropna(subset=['Description'])
print(f"After removing missing descriptions: {len(df):,} rows")

# Step 5: Remove duplicates
dupes = df.duplicated().sum()
df = df.drop_duplicates()
print(f"After removing {dupes:,} duplicates:      {len(df):,} rows")

# Step 6: Fix data types
df['InvoiceNo']   = df['InvoiceNo'].astype(str)
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['StockCode']   = df['StockCode'].astype(str)
df['Description'] = df['Description'].str.strip().str.upper()

# Step 7: Add derived columns
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df['YearMonth']  = df['InvoiceDate'].dt.strftime('%Y-%m')
df['DayOfWeek']  = df['InvoiceDate'].dt.day_name()
df['Hour']       = df['InvoiceDate'].dt.hour
df['Month']      = df['InvoiceDate'].dt.to_period('M')

# ── SPLIT INTO TWO SUBSETS ──

# Customer-level analysis — requires CustomerID
df_customers = df[df['CustomerID'].notna()].copy()
df_customers['CustomerID'] = (df_customers['CustomerID']
                               .astype(int).astype(str))

# Transaction-level analysis — all clean records
df_transactions = df.copy()


print(f"DATASET SPLIT SUMMARY")

print(f"Original records:          {original_size:>10,}")
print(f"Clean records (full):      {len(df_transactions):>10,}")
print(f"Records removed:           {original_size - len(df):>10,}")
print(f"")
print(f"df_customers (with ID):    {len(df_customers):>10,}")
print(f"df_transactions (all):     {len(df_transactions):>10,}")
print(f"")
print(f"Unique customers:          "
      f"{df_customers['CustomerID'].nunique():>10,}")
print(f"Unique products:           "
      f"{df_transactions['StockCode'].nunique():>10,}")
print(f"Unique invoices:           "
      f"{df_transactions['InvoiceNo'].nunique():>10,}")
print(f"Countries covered:         "
      f"{df_transactions['Country'].nunique():>10,}")
print(f"Date range:                "
      f"{df_transactions['InvoiceDate'].min().date()} to "
      f"{df_transactions['InvoiceDate'].max().date()}")
print(f"Total revenue:             "
      f"£{df_transactions['TotalPrice'].sum():>10,.2f}")

print(f"\nSubset usage guide:")
print(f"  Customer Segmentation  → df_customers")
print(f"  Churn Prediction       → df_customers")
print(f"  Cross-Sell             → df_transactions")
print(f"  Demand Forecasting     → df_transactions")
print(f"  Inventory              → df_transactions")
print(f"  Price Optimisation     → df_transactions")
Original dataset: 541,909 rows

After removing cancellations:        532,621 rows (removed 9,288)
After removing invalid quantities:   531,285 rows
After removing invalid prices:       530,104 rows
After removing missing descriptions: 530,104 rows
After removing 5,226 duplicates:      524,878 rows
DATASET SPLIT SUMMARY
Original records:             541,909
Clean records (full):         524,878
Records removed:               17,031

df_customers (with ID):       392,692
df_transactions (all):        524,878

Unique customers:               4,338
Unique products:                3,922
Unique invoices:               19,960
Countries covered:                 38
Date range:                2010-12-01 to 2011-12-09
Total revenue:             £10,642,110.80

Subset usage guide:
  Customer Segmentation  → df_customers
  Churn Prediction       → df_customers
  Cross-Sell             → df_transactions
  Demand Forecasting     → df_transactions
  Inventory              → df_transactions
  Price Optimisation     → df_transactions

Preprocessing Interpretation

The cleaning process removed 17,031 records — 3.1% of the original dataset. This is a conservative and defensible removal rate, achieved by treating missing CustomerID as a data characteristic rather than a disqualifying flaw.

Rather than discarding the 24.93% of records without CustomerID, the dataset was split into two purpose-specific subsets:

  • df_customers (392,692 records): used exclusively for customer-level analysis where identity is essential: segmentation and churn prediction
  • df_transactions (524,878 records): used for all product and transaction-level analysis: cross-sell, demand forecasting, inventory, and price optimisation

The clean dataset spans 13 months of trading across 38 countries, covering 4,338 identifiable customers, 3,922 unique products, and £10,642,110.80 in total revenue, a rich and reliable foundation for all six analytical themes that follow.

Analysis

Business Question

Who are NovaMart’s most valuable customers and how should the business treat them differently?

Methodology

RFM Analysis evaluates every customer across three dimensions: Recency, Frequency, and Monetary value. KMeans clustering groups customers into distinct segments based on their combined RFM profile.

Dataset used: df_customers (392,692 records · 4,338 customers)

RFM Feature Enginerring

Show Code
# Reference date — one day after the last transaction
reference_date = df_customers['InvoiceDate'].max() + timedelta(days=1)

rfm = df_customers.groupby('CustomerID').agg(
    Recency   = ('InvoiceDate', 
                 lambda x: (reference_date - x.max()).days),
    Frequency = ('InvoiceNo', 'nunique'),
    Monetary  = ('TotalPrice', 'sum')
).reset_index()

print("RFM Table: First 10 Customers:")
display(rfm.head(10))
print("RFM Summary Statistics:")
display(rfm[['Recency','Frequency','Monetary']].describe().round(2))
RFM Table: First 10 Customers:
CustomerID Recency Frequency Monetary
0 12346 326 1 77183.60
1 12347 2 7 4310.00
2 12348 75 4 1797.24
3 12349 19 1 1757.55
4 12350 310 1 334.40
5 12352 36 8 2506.04
6 12353 204 1 89.00
7 12354 232 1 1079.40
8 12355 214 1 459.40
9 12356 23 3 2811.43
RFM Summary Statistics:
Recency Frequency Monetary
count 4338.00 4338.00 4338.00
mean 92.54 4.27 2048.69
std 100.01 7.70 8985.23
min 1.00 1.00 3.75
25% 18.00 1.00 306.48
50% 51.00 2.00 668.57
75% 142.00 5.00 1660.60
max 374.00 209.00 280206.02

RFM Scoring and Clustering

Show Code
#  Log transform Monetary and Frequency to reduce skew
rfm['Recency_log']   = np.log1p(rfm['Recency'])
rfm['Frequency_log'] = np.log1p(rfm['Frequency'])
rfm['Monetary_log']  = np.log1p(rfm['Monetary'])

# Scale features
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(
    rfm[['Recency_log', 'Frequency_log', 'Monetary_log']]
)

# ── FIND OPTIMAL K — ELBOW METHOD ──
inertia = []
k_range = range(2, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=123, n_init=10)
    km.fit(rfm_scaled)
    inertia.append(km.inertia_)

# Plot elbow curve
fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_facecolor('#0d0d1a')
ax.set_facecolor('#13131f')

ax.plot(k_range, inertia, color='#00c896', linewidth=2.5,
        marker='o', markersize=7, markerfacecolor='#f0a500',
        markeredgecolor='#f0a500')

ax.set_xlabel('Number of Clusters (k)', color='#c8c8d8', fontsize=11)
ax.set_ylabel('Inertia', color='#c8c8d8', fontsize=11)
ax.set_title('Figure: Elbow Method: Optimal Number of Clusters',
             color='#ffffff', fontsize=13, fontweight='bold', pad=15)
ax.tick_params(colors='#c8c8d8')
ax.spines[:].set_color('#2a2a3a')
ax.grid(color='#2a2a3a', linestyle='--', linewidth=0.7)

plt.tight_layout()
plt.show()

print("\nInertia values by k:")
for k, i in zip(k_range, inertia):
    print(f"  k={k}: {i:.2f}")


Inertia values by k:
  k=2: 6483.58
  k=3: 4869.68
  k=4: 3938.91
  k=5: 3297.51
  k=6: 2854.95
  k=7: 2548.80
  k=8: 2344.08
  k=9: 2156.08
  k=10: 2000.55

Final Clustering

Show Code
km_final = KMeans(n_clusters=4, random_state=123, n_init=10)
rfm['Cluster'] = km_final.fit_predict(rfm_scaled)

# ── CLUSTER PROFILING ──
cluster_profile = rfm.groupby('Cluster').agg(
    Customers  = ('CustomerID', 'count'),
    Avg_Recency   = ('Recency', 'mean'),
    Avg_Frequency = ('Frequency', 'mean'),
    Avg_Monetary  = ('Monetary', 'mean'),
    Total_Revenue = ('Monetary', 'sum')
).round(2).reset_index()

# ── LABEL CLUSTERS BASED ON PROFILE ──
# Sort by Monetary descending to assign labels
cluster_profile = cluster_profile.sort_values(
    'Avg_Monetary', ascending=False
).reset_index(drop=True)

labels = ['Champions', 'Loyal Customers', 'At Risk', 'Lost']
cluster_profile['Segment'] = labels

# Map labels back to rfm dataframe
label_map = dict(zip(cluster_profile['Cluster'],
                     cluster_profile['Segment']))
rfm['Segment'] = rfm['Cluster'].map(label_map)

print("Cluster Profiles:")
display(cluster_profile[['Segment', 'Customers', 'Avg_Recency',
                          'Avg_Frequency', 'Avg_Monetary',
                          'Total_Revenue']])
Cluster Profiles:
Segment Customers Avg_Recency Avg_Frequency Avg_Monetary Total_Revenue
0 Champions 717 12.21 13.71 8059.18 5778434.66
1 Loyal Customers 1183 71.36 4.05 1782.99 2109272.01
2 At Risk 835 18.01 2.15 546.76 456547.39
3 Lost 1603 182.91 1.32 338.71 542954.83

Segment Visualization

Show Code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Customer Segmentation — RFM Analysis',
             color='#ffffff', fontsize=15, fontweight='bold', y=1.01)

colors = ['#00c896', '#f0a500', '#4d9fff', '#ff6b6b']
segments = cluster_profile['Segment'].tolist()

# ── Plot 1: Customer Count by Segment ──
ax1 = axes[0, 0]
ax1.set_facecolor('#13131f')
bars = ax1.bar(segments, cluster_profile['Customers'],
               color=colors, edgecolor='none', width=0.6)
ax1.set_title('Customers per Segment',
              color='#ffffff', fontweight='bold', pad=10)
ax1.set_ylabel('Number of Customers', color='#c8c8d8')
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.set_facecolor('#13131f')
for bar, val in zip(bars, cluster_profile['Customers']):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20,
             f'{val:,}', ha='center', color='#ffffff', fontsize=10)

# ── Plot 2: Average Monetary by Segment ──
ax2 = axes[0, 1]
ax2.set_facecolor('#13131f')
bars2 = ax2.bar(segments, cluster_profile['Avg_Monetary'],
                color=colors, edgecolor='none', width=0.6)
ax2.set_title('Average Spend per Segment (£)',
              color='#ffffff', fontweight='bold', pad=10)
ax2.set_ylabel('Average Monetary Value (£)', color='#c8c8d8')
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.set_facecolor('#13131f')
for bar, val in zip(bars2, cluster_profile['Avg_Monetary']):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
             f'£{val:,.0f}', ha='center', color='#ffffff', fontsize=9)

# ── Plot 3: Total Revenue by Segment ──
ax3 = axes[1, 0]
ax3.set_facecolor('#13131f')
revenue_pct = (cluster_profile['Total_Revenue'] /
               cluster_profile['Total_Revenue'].sum() * 100).round(1)
wedges, texts, autotexts = ax3.pie(
    cluster_profile['Total_Revenue'],
    labels=segments,
    colors=colors,
    autopct='%1.1f%%',
    startangle=90,
    pctdistance=0.75
)
for text in texts:
    text.set_color('#c8c8d8')
    text.set_fontsize(10)
for autotext in autotexts:
    autotext.set_color('#ffffff')
    autotext.set_fontsize(9)
    autotext.set_fontweight('bold')
ax3.set_title('Revenue Share by Segment',
              color='#ffffff', fontweight='bold', pad=10)
ax3.set_facecolor('#13131f')

# ── Plot 4: Recency vs Frequency (bubble = monetary) ──
ax4 = axes[1, 1]
ax4.set_facecolor('#13131f')
for i, row in cluster_profile.iterrows():
    seg_data = rfm[rfm['Segment'] == row['Segment']]
    ax4.scatter(seg_data['Recency'], seg_data['Frequency'],
                alpha=0.4, s=20, color=colors[i],
                label=row['Segment'])
ax4.set_title('Recency vs Frequency by Segment',
              color='#ffffff', fontweight='bold', pad=10)
ax4.set_xlabel('Recency (days)', color='#c8c8d8')
ax4.set_ylabel('Frequency (purchases)', color='#c8c8d8')
ax4.tick_params(colors='#c8c8d8')
ax4.spines[:].set_color('#2a2a3a')
ax4.legend(facecolor='#13131f', labelcolor='#c8c8d8',
           fontsize=9, framealpha=0.8)

for ax in axes.flat:
    ax.set_facecolor('#13131f')

plt.tight_layout()
plt.show()

Segmentation Findings and Business Recommendations

RFM clustering identified four distinct customer segments with dramatically different value profiles.

The 80/20 finding, more extreme than expected: NovaMart’s Champions segment, just 717 customers, 16.5% of the customer base, generates £5,778,434 — 65% of total revenue. This is a critical business risk.

Segment Customers Avg Recency Avg Frequency Avg Spend Revenue Share
Champions 717 12 days 13.7 purchases £8,059 65.0%
Loyal Customers 1,183 71 days 4.1 purchases £1,783 23.7%
At Risk 835 18 days 2.2 purchases £547 5.1%
Lost 1,603 183 days 1.3 purchases £339 6.1%

Recommendations:

  • Champions: Protect at all costs. Launch a VIP programme with early access, exclusive offers, and dedicated account management.
  • Loyal Customers: Target with frequency-driving campaigns. Loyalty points, volume discounts, and personalised product recommendations to move them into Champions.
  • At Risk: Intervene immediately with a time-limited re-engagement campaign.
  • Lost: Selective winback only. Focus on Lost customers who previously had higher spend history.

Business Question

Which customers are at risk of churning and how accurately can this be predicted before it happens?

Methodology

Churn is defined as a customer who has not made a purchase in the last 90 days. Seven behavioural features derived from transaction history were used — Recency was deliberately excluded as a predictive feature to avoid target leakage.

Dataset used: df_customers (392,692 records · 4,338 customers)

Churn Labelling

Show Code
# Define churn threshold — 90 days
churn_threshold = 90

# Label customers as churned (1) or active (0)
rfm['Churned'] = (rfm['Recency'] > churn_threshold).astype(int)

# Churn summary
total      = len(rfm)
churned    = rfm['Churned'].sum()
active     = total - churned
churn_rate = churned / total * 100

print("Churn Definition: No purchase in last 90 days")
print(f"Total Customers:    {total:>8,}")
print(f"Active Customers:   {active:>8,}  ({100-churn_rate:.1f}%)")
print(f"Churned Customers:  {churned:>8,}  ({churn_rate:.1f}%)")
Churn Definition: No purchase in last 90 days
Total Customers:       4,338
Active Customers:      2,889  (66.6%)
Churned Customers:     1,449  (33.4%)

Churn Feature Engeering (Transaction-Based)

Show Code
# Reference date
reference_date = df_customers['InvoiceDate'].max() + timedelta(days=1)

# Build behavioural features per customer
churn_features = df_customers.groupby('CustomerID').agg(
    TotalOrders      = ('InvoiceNo', 'nunique'),
    TotalRevenue     = ('TotalPrice', 'sum'),
    AvgOrderValue    = ('TotalPrice', 'mean'),
    TotalItems       = ('Quantity', 'sum'),
    UniqueProducts   = ('StockCode', 'nunique'),
    MonthsActive     = ('YearMonth', 'nunique'),
    AvgDaysBetween   = ('InvoiceDate', lambda x: 
                        x.sort_values().diff().dt.days.mean()
                        if len(x) > 1 else 0),
    LastPurchase     = ('InvoiceDate', 'max')
).reset_index()

# Calculate recency separately for churn label only
churn_features['Recency'] = (
    reference_date - churn_features['LastPurchase']
).dt.days

# Define churn label
churn_features['Churned'] = (
    churn_features['Recency'] > 90
).astype(int)

# Fill NaN in AvgDaysBetween (customers with 1 order)
churn_features['AvgDaysBetween'] = (
    churn_features['AvgDaysBetween'].fillna(0)
)

# Drop columns not used as features
features = [
    'TotalOrders', 'TotalRevenue', 'AvgOrderValue',
    'TotalItems', 'UniqueProducts', 'MonthsActive',
    'AvgDaysBetween'
]

X = churn_features[features]
y = churn_features['Churned']

# Summary
print(f"Churn Feature Matrix: {X.shape[0]:,} customers x "
      f"{X.shape[1]} features")
print(f"\nFeatures used: {features}")
print(f"\nChurn distribution:")
print(f"  Active:  {(y==0).sum():,} ({(y==0).mean()*100:.1f}%)")
print(f"  Churned: {(y==1).sum():,} ({(y==1).mean()*100:.1f}%)")
print(f"\nFeature Summary:")
display(X.describe().round(2)) 
Churn Feature Matrix: 4,338 customers x 7 features

Features used: ['TotalOrders', 'TotalRevenue', 'AvgOrderValue', 'TotalItems', 'UniqueProducts', 'MonthsActive', 'AvgDaysBetween']

Churn distribution:
  Active:  2,889 (66.6%)
  Churned: 1,449 (33.4%)

Feature Summary:
TotalOrders TotalRevenue AvgOrderValue TotalItems UniqueProducts MonthsActive AvgDaysBetween
count 4338.00 4338.00 4338.00 4338.00 4338.00 4338.00 4338.00
mean 4.27 2048.69 68.38 1187.64 61.50 3.01 3.22
std 7.70 8985.23 1467.92 5043.62 85.37 2.63 10.79
min 1.00 3.75 2.14 1.00 1.00 1.00 0.00
25% 1.00 306.48 12.39 159.00 16.00 1.00 0.00
50% 2.00 668.57 17.76 378.00 35.00 2.00 1.07
75% 5.00 1660.60 24.88 989.75 77.00 4.00 2.95
max 209.00 280206.02 77183.60 196915.00 1787.00 13.00 309.00

Model Training

Show Code
# ── SCALE AND SPLIT ──
scaler_churn = StandardScaler()
X_scaled = scaler_churn.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=123, stratify=y
)

print(f"Training set: {X_train.shape[0]:,} customers")
print(f"Test set:     {X_test.shape[0]:,} customers")

# ── MODEL TRAINING ──
models = {
    'Logistic Regression': LogisticRegression(random_state=123,
                                              max_iter=1000),
    'Random Forest':       RandomForestClassifier(random_state=123,
                                                  n_estimators=100),
    'Decision Tree':       DecisionTreeClassifier(random_state=123)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred  = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    results[name] = {
        'Accuracy':  round(accuracy_score(y_test, y_pred) * 100, 2),
        'ROC-AUC':   round(roc_auc_score(y_test, y_proba) * 100, 2),
        'y_pred':    y_pred,
        'y_proba':   y_proba,
        'model':     model
    }
    print(f"{name} — Trained")

# ── COMPARISON TABLE ──
comparison = pd.DataFrame({
    'Model':    list(results.keys()),
    'Accuracy': [results[m]['Accuracy'] for m in results],
    'ROC-AUC':  [results[m]['ROC-AUC']  for m in results]
}).sort_values('ROC-AUC', ascending=False).reset_index(drop=True)


print("Model Comparison Churn Prediction")


display(comparison)
print(f"\nBest Model: {comparison['Model'].iloc[0]}")
print(f"ROC-AUC:    {comparison['ROC-AUC'].iloc[0]}%")
print(f"Accuracy:   {comparison['Accuracy'].iloc[0]}%")
Training set: 3,470 customers
Test set:     868 customers
Logistic Regression — Trained
Random Forest — Trained
Decision Tree — Trained
Model Comparison Churn Prediction
Model Accuracy ROC-AUC
0 Logistic Regression 71.20 76.39
1 Random Forest 72.12 76.06
2 Decision Tree 63.48 58.49

Best Model: Logistic Regression
ROC-AUC:    76.39%
Accuracy:   71.2%

Confussion Matrix: Logistic Regression

Show Code
# ── CONFUSION MATRIX LOGISTIC REGRESSION ──
best_name  = 'Logistic Regression'
best_pred  = results[best_name]['y_pred']
best_proba = results[best_name]['y_proba']

cm = confusion_matrix(y_test, best_pred)

print("Confusion Matrix — Logistic Regression")

print(f"True Negatives  (Active correctly identified):   {cm[0,0]}")
print(f"False Positives (Active wrongly flagged churned):{cm[0,1]}")
print(f"False Negatives (Churned missed by model):       {cm[1,0]}")
print(f"True Positives  (Churned correctly identified):  {cm[1,1]}")

# ── ACTUAL VS PREDICTED SUMMARY ──
churn_results = pd.DataFrame({
    'Actual':    y_test.values,
    'Predicted': best_pred
})

churn_summary = churn_results.groupby(
    ['Actual', 'Predicted']
).size().reset_index(name='Count')

churn_summary['Actual']    = churn_summary['Actual'].map(
    {0: 'Active', 1: 'Churned'}
)
churn_summary['Predicted'] = churn_summary['Predicted'].map(
    {0: 'Active', 1: 'Churned'}
)
churn_summary['Result'] = churn_summary.apply(
    lambda r: 'Correct' if r['Actual'] == r['Predicted']
    else 'Incorrect', axis=1
)

print("Actual vs Predicted Summary:")
display(churn_summary)

# ── PREDICTION BREAKDOWN ──
correct   = (churn_results['Actual'] == churn_results['Predicted']).sum()
incorrect = len(churn_results) - correct

print(f"\nPrediction Breakdown:")
print(f"  Correctly classified:   {correct}")
print(f"  Incorrectly classified: {incorrect}")
print(f"  Overall Accuracy:       {round(correct/len(churn_results)*100, 2)}%")
Confusion Matrix — Logistic Regression
True Negatives  (Active correctly identified):   459
False Positives (Active wrongly flagged churned):119
False Negatives (Churned missed by model):       131
True Positives  (Churned correctly identified):  159
Actual vs Predicted Summary:
Actual Predicted Count Result
0 Active Active 459 Correct
1 Active Churned 119 Incorrect
2 Churned Active 131 Incorrect
3 Churned Churned 159 Correct

Prediction Breakdown:
  Correctly classified:   618
  Incorrectly classified: 250
  Overall Accuracy:       71.2%

Save Best Churn Model

Show Code
# ── SAVE BEST CHURN MODEL ──
import joblib

joblib.dump(results['Logistic Regression']['model'], 
            'best_model_churn.pkl')
joblib.dump(scaler_churn, 
            'scaler_churn.pkl')

print("Churn model saved successfully.")
print("Model:   best_model_churn.pkl")
print("Scaler:  scaler_churn.pkl")
print("\nNote: Scaler saved alongside model to ensure")
print("new customer data is transformed consistently")
print("before scoring.")
Churn model saved successfully.
Model:   best_model_churn.pkl
Scaler:  scaler_churn.pkl

Note: Scaler saved alongside model to ensure
new customer data is transformed consistently
before scoring.

Churn Prediction Findings and Business Recommendations

Three models were trained on seven behavioural features. Recency was excluded to prevent target leakage.

Model Accuracy ROC-AUC
Logistic Regression 71.20% 76.39%
Random Forest 72.12% 76.06%
Decision Tree 63.48% 58.49%

Of 868 test customers, 618 were correctly classified (71.2%). The model correctly identified 159 of 290 churned customers and 459 of 578 active customers.

Recommendations:

  • Deploy the Logistic Regression model to score all customers monthly and flag those with a churn probability above 0.4.
  • Supplement predictive scoring with rule-based triggers such as 60 days of inactivity or a sudden drop in order frequency.
  • Prioritise retention spend on Champions and Loyal segments identified in the first analysis.

Business Question

What products do customers naturally buy together, and how can NovaMart use this to increase average basket size?

Methodology: Association Rules (Apriori Algorithm)

The Apriori algorithm identifies frequent itemsets, groups of products that appear together in transactions regularly. From these itemsets, association rules are generated in the form of “customers who buy Product A also tend to buy Product B.”

Three key metrics guide rule selection:

  • Support: how often the itemset appears across all transactions. A support of 0.02 means the combination appears in at least 2% of all baskets.
  • Confidence: given that a customer bought Product A, how likely are they to also buy Product B.
  • Lift: how much more likely the combination is compared to random chance. A lift above 1 indicates a genuine association. Higher lift means stronger recommendation.

Dataset used: df_transactions (524,878 records)

Croos-Sell Basket Preparation

Show Code
# Build basket matrix — one row per invoice, one column per product
# UK transactions only for cleaner, more representative rules
df_uk = df_transactions[
    df_transactions['Country'] == 'United Kingdom'
].copy()

print(f"UK transactions: {len(df_uk):,} rows")
print(f"UK invoices:     {df_uk['InvoiceNo'].nunique():,}")
print(f"UK products:     {df_uk['StockCode'].nunique():,}")

# Create basket — True/False for each product per invoice
basket = df_uk.groupby(
    ['InvoiceNo', 'Description']
)['Quantity'].sum().unstack(fill_value=0)

# Convert quantities to binary (bought or not)
basket_binary = basket.applymap(lambda x: 1 if x > 0 else 0)

print(f"\nBasket matrix shape: {basket_binary.shape}")
print(f"Invoices: {basket_binary.shape[0]:,}")
print(f"Products: {basket_binary.shape[1]:,}")
UK transactions: 479,985 rows
UK invoices:     18,019
UK products:     3,916

Basket matrix shape: (18019, 3996)
Invoices: 18,019
Products: 3,996

Apriori Algorithm

Show Code
# Generate frequent itemsets
# min_support = 0.02 means product combo appears in at least 2% of baskets
frequent_itemsets = apriori(
    basket_binary,
    min_support     = 0.02,
    use_colnames    = True,
    max_len         = 2
)

frequent_itemsets = frequent_itemsets.sort_values(
    'support', ascending=False
).reset_index(drop=True)

print(f"Frequent itemsets found: {len(frequent_itemsets):,}")
print(f"\nTop 10 most frequent itemsets:")
display(frequent_itemsets.head(10))

# ── GENERATE ASSOCIATION RULES ──
rules = association_rules(
    frequent_itemsets,
    metric          = 'lift',
    min_threshold   = 1.5
)

rules = rules.sort_values('lift', ascending=False).reset_index(drop=True)

# Keep only the most useful columns
rules_clean = rules[[
    'antecedents', 'consequents',
    'support', 'confidence', 'lift'
]].copy()

rules_clean['antecedents'] = rules_clean['antecedents'].apply(
    lambda x: ', '.join(list(x))
)
rules_clean['consequents'] = rules_clean['consequents'].apply(
    lambda x: ', '.join(list(x))
)
rules_clean = rules_clean.round(4)

print(f"\nAssociation rules generated: {len(rules_clean):,}")
print(f"\nTop 15 rules by lift:")
display(rules_clean.head(15))
Frequent itemsets found: 397

Top 10 most frequent itemsets:
support itemsets
0 0.119984 (WHITE HANGING HEART T-LIGHT HOLDER)
1 0.107387 (JUMBO BAG RED RETROSPOT)
2 0.093512 (REGENCY CAKESTAND 3 TIER)
3 0.088407 (PARTY BUNTING)
4 0.077252 (LUNCH BAG RED RETROSPOT)
5 0.076086 (ASSORTED COLOUR BIRD ORNAMENT)
6 0.068872 (SET OF 3 CAKE TINS PANTRY DESIGN)
7 0.067651 (NATURAL SLATE HEART CHALKBOARD)
8 0.067484 (LUNCH BAG BLACK SKULL.)
9 0.064598 (HEART OF WICKER SMALL)

Association rules generated: 194

Top 15 rules by lift:
antecedents consequents support confidence lift
0 WOODEN HEART CHRISTMAS SCANDINAVIAN WOODEN STAR CHRISTMAS SCANDINAVIAN 0.0204 0.7230 27.1973
1 WOODEN STAR CHRISTMAS SCANDINAVIAN WOODEN HEART CHRISTMAS SCANDINAVIAN 0.0204 0.7683 27.1973
2 PINK REGENCY TEACUP AND SAUCER GREEN REGENCY TEACUP AND SAUCER 0.0320 0.8205 15.8635
3 GREEN REGENCY TEACUP AND SAUCER PINK REGENCY TEACUP AND SAUCER 0.0320 0.6180 15.8635
4 DOLLY GIRL LUNCH BOX SPACEBOY LUNCH BOX 0.0236 0.6094 15.6655
5 SPACEBOY LUNCH BOX DOLLY GIRL LUNCH BOX 0.0236 0.6077 15.6655
6 STRAWBERRY CHARLOTTE BAG WOODLAND CHARLOTTE BAG 0.0208 0.5557 14.7042
7 WOODLAND CHARLOTTE BAG STRAWBERRY CHARLOTTE BAG 0.0208 0.5492 14.7042
8 PINK REGENCY TEACUP AND SAUCER ROSES REGENCY TEACUP AND SAUCER 0.0302 0.7764 14.6330
9 ROSES REGENCY TEACUP AND SAUCER PINK REGENCY TEACUP AND SAUCER 0.0302 0.5701 14.6330
10 GARDENERS KNEELING PAD CUP OF TEA GARDENERS KNEELING PAD KEEP CALM 0.0300 0.7213 14.4741
11 GARDENERS KNEELING PAD KEEP CALM GARDENERS KNEELING PAD CUP OF TEA 0.0300 0.6024 14.4741
12 STRAWBERRY CHARLOTTE BAG CHARLOTTE BAG PINK POLKADOT 0.0201 0.5379 14.4015
13 CHARLOTTE BAG PINK POLKADOT STRAWBERRY CHARLOTTE BAG 0.0201 0.5379 14.4015
14 CHARLOTTE BAG PINK POLKADOT RED RETROSPOT CHARLOTTE BAG 0.0266 0.7117 14.2025

Cross-Sell Visualization

Show Code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Cross-Sell Association Rules',
             color='#ffffff', fontsize=14, fontweight='bold')

# ── Plot 1: Support vs Confidence, sized by Lift ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')

scatter = ax1.scatter(
    rules_clean['support'],
    rules_clean['confidence'],
    c     = rules_clean['lift'],
    s     = rules_clean['lift'] * 3,
    cmap  = 'YlGn',
    alpha = 0.7,
    edgecolors = 'none'
)

cbar = plt.colorbar(scatter, ax=ax1)
cbar.set_label('Lift', color='#c8c8d8')
cbar.ax.yaxis.set_tick_params(color='#c8c8d8')
plt.setp(cbar.ax.yaxis.get_ticklabels(), color='#c8c8d8')

ax1.set_xlabel('Support', color='#c8c8d8', fontsize=10)
ax1.set_ylabel('Confidence', color='#c8c8d8', fontsize=10)
ax1.set_title('Support vs Confidence\n(size and colour = lift)',
              color='#ffffff', fontweight='bold', pad=10)
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.grid(color='#2a2a3a', linestyle='--', linewidth=0.6)

# ── Plot 2: Top 10 Rules by Lift ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')

top10 = rules_clean.head(10).copy()
top10['rule'] = (top10['antecedents'].str[:25] + ' →\n' +
                 top10['consequents'].str[:25])

bars = ax2.barh(
    range(len(top10)),
    top10['lift'],
    color     = '#00c896',
    edgecolor = 'none',
    height    = 0.6
)

ax2.set_yticks(range(len(top10)))
ax2.set_yticklabels(top10['rule'], color='#c8c8d8', fontsize=7)
ax2.set_xlabel('Lift', color='#c8c8d8', fontsize=10)
ax2.set_title('Top 10 Cross-Sell Rules by Lift',
              color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--', linewidth=0.6, axis='x')
ax2.invert_yaxis()

for bar, val in zip(bars, top10['lift']):
    ax2.text(bar.get_width() + 0.3, bar.get_y() + bar.get_height()/2,
             f'{val:.1f}', va='center', color='#ffffff', fontsize=8)

plt.tight_layout()
plt.show()

Cross-Sell Findings and Business Recommendations

194 association rules identified from 397 frequent itemsets across 18,019 UK invoices.

If Customer Buys Recommend Confidence Lift
Wooden Heart Christmas Scandinavian Wooden Star Christmas Scandinavian 72.3% 27.2
Pink Regency Teacup and Saucer Green Regency Teacup and Saucer 82.1% 15.9
Dolly Girl Lunch Box Spaceboy Lunch Box 60.9% 15.7
Woodland Charlotte Bag Strawberry Charlotte Bag 54.9% 14.7

NovaMart customers are collectors and gift buyers. The strongest associations reflect themed sets and colour variants within the same range.

Recommendations:

  • Implement a “Frequently Bought Together” section on all product pages powered by these association rules.
  • Bundle top associated pairs into curated gift sets with a slight price incentive.
  • Trigger cross-sell emails within 48 hours of purchase to capture buying intent while it is still active.

Business Question

How much will NovaMart sell each month, and when should the business expect demand peaks and troughs?

Methodology: Exponential Smoothing (Holt-Winters)

Demand forecasting uses time series analysis to project future sales based on historical patterns. Holt-Winters Exponential Smoothing was selected because it handles three components simultaneously:

  • Level: the baseline average demand
  • Trend: whether demand is growing or declining over time
  • Seasonality: recurring peaks and troughs at regular intervals

NovaMart operates in retail gifting and home decor, making seasonality a critical factor. The model is trained on 13 months of historical monthly revenue and used to forecast the next 3 months.

Dataset used: df_transactions (524,878 records)

Demand Forecasting: Monthly Revenue Time Series

Show Code
# Aggregate total revenue by month
monthly_revenue = df_transactions.groupby('YearMonth').agg(
    Revenue  = ('TotalPrice', 'sum'),
    Orders   = ('InvoiceNo', 'nunique'),
    Quantity = ('Quantity', 'sum')
).reset_index().sort_values('YearMonth')

print("Monthly Revenue Summary:")
display(monthly_revenue)
print(f"\nTotal months:    {len(monthly_revenue)}")
print(f"Total revenue:   £{monthly_revenue['Revenue'].sum():,.2f}")
print(f"Peak month:      "
      f"{monthly_revenue.loc[monthly_revenue['Revenue'].idxmax(), 'YearMonth']}")
print(f"Peak revenue:    "
      f"£{monthly_revenue['Revenue'].max():,.2f}")
print(f"Lowest month:    "
      f"{monthly_revenue.loc[monthly_revenue['Revenue'].idxmin(), 'YearMonth']}")
print(f"Lowest revenue:  "
      f"£{monthly_revenue['Revenue'].min():,.2f}")
# Exclude partial December 2011 and build time series
monthly_clean = monthly_revenue[
    monthly_revenue['YearMonth'] != '2011-12'
].copy()

revenue_series = monthly_clean.set_index('YearMonth')['Revenue']

print(f"\nTraining series: {len(revenue_series)} months")
print(f"Period: {revenue_series.index[0]} to {revenue_series.index[-1]}")
Monthly Revenue Summary:
YearMonth Revenue Orders Quantity
0 2010-12 821452.730 1559 358019
1 2011-01 689811.610 1086 387099
2 2011-02 522545.560 1100 282934
3 2011-03 716215.260 1454 376599
4 2011-04 536968.491 1246 307953
5 2011-05 769296.610 1681 395001
6 2011-06 760547.010 1533 388511
7 2011-07 718076.121 1475 399693
8 2011-08 757841.380 1361 421020
9 2011-09 1056435.192 1837 569573
10 2011-10 1151263.730 2040 621029
11 2011-11 1503866.780 2769 751377
12 2011-12 637790.330 819 313612

Total months:    13
Total revenue:   £10,642,110.80
Peak month:      2011-11
Peak revenue:    £1,503,866.78
Lowest month:    2011-02
Lowest revenue:  £522,545.56

Training series: 12 months
Period: 2010-12 to 2011-11

Holt-Winters Forecasting (Trend Only)

Show Code
# With only 12 months of data, full seasonal modelling requires
# a minimum of 24 months (two complete cycles).
# We apply a trend-only Exponential Smoothing model honest
# and appropriate given the data available.

model = ExponentialSmoothing(
    revenue_series,
    trend       = 'add',
    seasonal    = None,
    initialization_method = 'estimated'
)

fitted_model = model.fit(optimized=True)

# Forecast next 3 months
forecast = fitted_model.forecast(3)
forecast_months = ['2011-12', '2012-01', '2012-02']
forecast.index = forecast_months

print(f"Forecast Next 3 Months:")

for month, value in zip(forecast_months, forecast):
    print(f"  {month}:  £{value:,.2f}")


# Model fit metrics
fitted_values = fitted_model.fittedvalues
mae  = mean_absolute_error(revenue_series, fitted_values)
rmse = np.sqrt(mean_squared_error(revenue_series, fitted_values))
mape = (np.abs((revenue_series - fitted_values) /
               revenue_series).mean() * 100)

print(f"\nModel Fit Metrics:")
print(f"  MAE:  £{mae:,.2f}")
print(f"  RMSE: £{rmse:,.2f}")
print(f"  MAPE: {mape:.2f}%")
Forecast Next 3 Months:
  2011-12:  £1,649,797.07
  2012-01:  £1,886,622.27
  2012-02:  £2,123,447.47

Model Fit Metrics:
  MAE:  £128,762.45
  RMSE: £158,154.22
  MAPE: 15.53%

Demand Forecast Visualization

Show Code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Demand Forecasting — Monthly Revenue',
             color='#ffffff', fontsize=14, fontweight='bold')

all_months  = list(revenue_series.index) + forecast_months
all_actual  = list(revenue_series.values)
all_fitted  = list(fitted_values.values)
all_forecast = list(forecast.values)

# ── Plot 1: Actual vs Fitted + Forecast ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')

ax1.plot(range(len(revenue_series)),
         all_actual,
         color='#00c896', linewidth=2.5,
         marker='o', markersize=5,
         label='Actual Revenue')

ax1.plot(range(len(revenue_series)),
         all_fitted,
         color='#f0a500', linewidth=2,
         linestyle='--', label='Fitted Values')

forecast_x = range(len(revenue_series),
                   len(revenue_series) + 3)
ax1.plot(list(forecast_x),
         all_forecast,
         color='#4d9fff', linewidth=2.5,
         marker='o', markersize=6,
         linestyle='--', label='Forecast')

ax1.axvline(x=len(revenue_series) - 0.5,
            color='#ffffff', linewidth=0.8,
            linestyle=':', alpha=0.5)
ax1.text(len(revenue_series) - 0.4,
         max(all_actual) * 0.95,
         'Forecast', color='#4d9fff',
         fontsize=8, fontfamily='monospace')

ax1.set_xticks(range(len(all_months)))
ax1.set_xticklabels(all_months, rotation=45,
                    ha='right', color='#c8c8d8', fontsize=7)
ax1.set_ylabel('Revenue (£)', color='#c8c8d8')
ax1.set_title('Actual vs Fitted vs Forecast',
              color='#ffffff', fontweight='bold', pad=10)
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.grid(color='#2a2a3a', linestyle='--', linewidth=0.6)
ax1.legend(facecolor='#13131f', labelcolor='#c8c8d8', fontsize=9)
ax1.yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)

# ── Plot 2: Monthly Revenue Bar Chart ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')

bar_colors = ['#00c896'] * len(revenue_series) + ['#4d9fff'] * 3
all_values = all_actual + all_forecast

bars = ax2.bar(range(len(all_months)),
               all_values,
               color=bar_colors,
               edgecolor='none', width=0.7)

ax2.set_xticks(range(len(all_months)))
ax2.set_xticklabels(all_months, rotation=45,
                    ha='right', color='#c8c8d8', fontsize=7)
ax2.set_ylabel('Revenue (£)', color='#c8c8d8')
ax2.set_title('Monthly Revenue with Forecast',
              color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--',
         linewidth=0.6, axis='y')
ax2.yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)

from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#00c896', label='Historical'),
    Patch(facecolor='#4d9fff', label='Forecast')
]
ax2.legend(handles=legend_elements,
           facecolor='#13131f',
           labelcolor='#c8c8d8', fontsize=9)

plt.tight_layout()
plt.show()

### Demand Forecasting Findings and Business Recommendations

Month Revenue
2011-02 £522,546 (lowest)
2011-09 £1,056,435 (surge begins)
2011-11 £1,503,867 (peak)

3-Month Forecast:

Month Forecast
2011-12 £1,649,797
2012-01 £1,886,622
2012-02 £2,123,447

Honest Assessment: The December forecast is plausible. January and February projections are likely overstated as the trend-only model cannot detect the post-Christmas demand reversal.

Recommendations:

  • Use December forecast as a planning target, adjusting upward by 10% given the strong November close.
  • Apply a manual seasonal correction for January and February based on the 2011 historical pattern.
  • Collect a second full year of data before investing in a production forecasting system.

Business Question

Which products need urgent restocking, which are being over-ordered, and how should NovaMart prioritise its inventory investment?

Methodology

Three analytical lenses: Velocity Analysis, Revenue Concentration, and Demand Consistency. Products are classified into four actionable inventory priority tiers.

Dataset used: df_transactions (524,878 records)

Inventory Analysis

Show Code
# Product-level aggregation
inventory = df_transactions.groupby(
    ['StockCode', 'Description']
).agg(
    TotalQuantity  = ('Quantity', 'sum'),
    TotalRevenue   = ('TotalPrice', 'sum'),
    TotalOrders    = ('InvoiceNo', 'nunique'),
    AvgOrderQty    = ('Quantity', 'mean'),
    MonthsActive   = ('YearMonth', 'nunique'),
    AvgMonthlyQty  = ('Quantity', lambda x:
                      x.sum() / df_transactions['YearMonth'].nunique())
).reset_index()

# Revenue share
inventory['RevenueShare'] = (
    inventory['TotalRevenue'] /
    inventory['TotalRevenue'].sum() * 100
).round(4)

# Demand consistency — months active out of 13 total
inventory['Consistency'] = (
    inventory['MonthsActive'] / 
    df_transactions['YearMonth'].nunique() * 100
).round(2)

# Sort by total revenue
inventory = inventory.sort_values(
    'TotalRevenue', ascending=False
).reset_index(drop=True)

# ── INVENTORY PRIORITY TIERS ──
def assign_tier(row):
    if row['RevenueShare'] >= 0.5 and row['Consistency'] >= 70:
        return 'Critical, Restock Immediately'
    elif row['RevenueShare'] >= 0.2 and row['Consistency'] >= 50:
        return 'High Priority, Monitor Closely'
    elif row['Consistency'] < 30 and row['TotalOrders'] < 10:
        return 'Review, Possible Dead Stock'
    else:
        return 'Standard, Routine Replenishment'

inventory['Tier'] = inventory.apply(assign_tier, axis=1)

# Summary
print("Inventory Tier Summary:")
display(inventory['Tier'].value_counts().to_frame())
print("Top 15 Products by Revenue:")
display(inventory[[
    'Description', 'TotalQuantity', 'TotalRevenue',
    'TotalOrders', 'Consistency', 'Tier'
]].head(15))
Inventory Tier Summary:
count
Tier
Standard, Routine Replenishment 3424
Review, Possible Dead Stock 654
High Priority, Monitor Closely 70
Critical, Restock Immediately 10
Top 15 Products by Revenue:
Description TotalQuantity TotalRevenue TotalOrders Consistency Tier
0 DOTCOM POSTAGE 706 206248.77 706 100.00 Critical, Restock Immediately
1 REGENCY CAKESTAND 3 TIER 13851 174156.54 1988 100.00 Critical, Restock Immediately
2 PAPER CRAFT , LITTLE BIRDIE 80995 168469.60 1 7.69 Review, Possible Dead Stock
3 WHITE HANGING HEART T-LIGHT HOLDER 37580 104284.24 2189 100.00 Critical, Restock Immediately
4 PARTY BUNTING 18283 99445.23 1685 100.00 Critical, Restock Immediately
5 JUMBO BAG RED RETROSPOT 48371 94159.81 2089 100.00 Critical, Restock Immediately
6 MEDIUM CERAMIC TOP STORAGE JAR 78033 81700.92 247 69.23 High Priority, Monitor Closely
7 POSTAGE 3150 78101.88 1126 100.00 Critical, Restock Immediately
8 MANUAL 6984 77750.27 289 100.00 Critical, Restock Immediately
9 RABBIT NIGHT LIGHT 30739 66870.03 994 61.54 High Priority, Monitor Closely
10 PAPER CHAIN KIT 50'S CHRISTMAS 19329 64875.59 1160 100.00 Critical, Restock Immediately
11 ASSORTED COLOUR BIRD ORNAMENT 36362 58927.62 1455 100.00 Critical, Restock Immediately
12 CHILLI LIGHTS 10302 54096.36 661 100.00 Critical, Restock Immediately
13 SPOTTY BUNTING 8320 42513.48 1140 61.54 High Priority, Monitor Closely
14 JUMBO BAG PINK POLKADOT 21448 42401.01 1218 100.00 High Priority, Monitor Closely

Clean Inventory, Remove Non-Product Entries

Show Code
non_products = ['DOTCOM POSTAGE', 'POSTAGE', 'MANUAL',
                'AMAZON FEE', 'BANK CHARGES', 'CRUK COMMISSION']

inventory_clean = inventory[
    ~inventory['Description'].str.upper().isin(non_products)
].reset_index(drop=True)

print("Inventory Tier Summary (clean):")
display(inventory_clean['Tier'].value_counts().to_frame())
print("Top 15 Products by Revenue (clean):")
display(inventory_clean[[
    'Description', 'TotalQuantity', 'TotalRevenue',
    'TotalOrders', 'Consistency', 'Tier'
]].head(15))
Inventory Tier Summary (clean):
count
Tier
Standard, Routine Replenishment 3423
Review, Possible Dead Stock 652
High Priority, Monitor Closely 70
Critical, Restock Immediately 7
Top 15 Products by Revenue (clean):
Description TotalQuantity TotalRevenue TotalOrders Consistency Tier
0 REGENCY CAKESTAND 3 TIER 13851 174156.54 1988 100.00 Critical, Restock Immediately
1 PAPER CRAFT , LITTLE BIRDIE 80995 168469.60 1 7.69 Review, Possible Dead Stock
2 WHITE HANGING HEART T-LIGHT HOLDER 37580 104284.24 2189 100.00 Critical, Restock Immediately
3 PARTY BUNTING 18283 99445.23 1685 100.00 Critical, Restock Immediately
4 JUMBO BAG RED RETROSPOT 48371 94159.81 2089 100.00 Critical, Restock Immediately
5 MEDIUM CERAMIC TOP STORAGE JAR 78033 81700.92 247 69.23 High Priority, Monitor Closely
6 RABBIT NIGHT LIGHT 30739 66870.03 994 61.54 High Priority, Monitor Closely
7 PAPER CHAIN KIT 50'S CHRISTMAS 19329 64875.59 1160 100.00 Critical, Restock Immediately
8 ASSORTED COLOUR BIRD ORNAMENT 36362 58927.62 1455 100.00 Critical, Restock Immediately
9 CHILLI LIGHTS 10302 54096.36 661 100.00 Critical, Restock Immediately
10 SPOTTY BUNTING 8320 42513.48 1140 61.54 High Priority, Monitor Closely
11 JUMBO BAG PINK POLKADOT 21448 42401.01 1218 100.00 High Priority, Monitor Closely
12 BLACK RECORD COVER FRAME 11651 40633.38 375 100.00 High Priority, Monitor Closely
13 PICNIC BASKET WICKER 60 PIECES 61 39619.50 2 7.69 Review, Possible Dead Stock
14 DOORMAT KEEP CALM AND COME IN 5487 38133.64 728 69.23 High Priority, Monitor Closely

Inventory Visualization

Show Code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Inventory Replenishment Analysis',
             color='#ffffff', fontsize=14, fontweight='bold')

# ── Plot 1: Tier Distribution ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')

tier_counts = inventory_clean['Tier'].value_counts()
tier_labels = [t.split(',')[0] for t in tier_counts.index]
tier_colors = ['#ff6b6b', '#f0a500', '#00c896', '#4d9fff']

wedges, texts, autotexts = ax1.pie(
    tier_counts,
    labels     = tier_labels,
    colors     = tier_colors,
    autopct    = '%1.1f%%',
    startangle = 90,
    pctdistance = 0.75
)
for text in texts:
    text.set_color('#c8c8d8')
    text.set_fontsize(9)
for autotext in autotexts:
    autotext.set_color('#ffffff')
    autotext.set_fontweight('bold')
    autotext.set_fontsize(9)

ax1.set_title('Product Distribution by Inventory Tier',
              color='#ffffff', fontweight='bold', pad=10)

# ── Plot 2: Top 10 Critical and High Priority Products ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')

top_products = inventory_clean[
    inventory_clean['Tier'].isin([
        'Critical, Restock Immediately',
        'High Priority, Monitor Closely'
    ])
].head(10).copy()

top_products['ShortName'] = (
    top_products['Description'].str[:30]
)

bar_colors = [
    '#ff6b6b' if 'Critical' in t else '#f0a500'
    for t in top_products['Tier']
]

bars = ax2.barh(
    range(len(top_products)),
    top_products['TotalRevenue'],
    color     = bar_colors,
    edgecolor = 'none',
    height    = 0.6
)

ax2.set_yticks(range(len(top_products)))
ax2.set_yticklabels(top_products['ShortName'],
                    color='#c8c8d8', fontsize=8)
ax2.set_xlabel('Total Revenue (£)', color='#c8c8d8')
ax2.set_title('Top Priority Products by Revenue',
              color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--',
         linewidth=0.6, axis='x')
ax2.invert_yaxis()
ax2.xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)

from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#ff6b6b', label='Critical'),
    Patch(facecolor='#f0a500', label='High Priority')
]
ax2.legend(handles=legend_elements,
           facecolor='#13131f',
           labelcolor='#c8c8d8', fontsize=9)

plt.tight_layout()
plt.show()

Inventory Replenishment Findings and Business Recommendations

Tier Products
Critical — Restock Immediately 7
High Priority — Monitor Closely 70
Standard — Routine Replenishment 3,423
Review — Possible Dead Stock 652

The 7 Critical products sell every single month without exception. A stockout on any of these during Q4 would cause direct, measurable revenue loss.

Recommendations:

  • Establish minimum stock thresholds for all 7 Critical products with automatic reorder triggers.
  • Increase Critical product stock levels by at least 40% from October onwards.
  • Review the 652 dead stock candidates quarterly and consider clearance pricing for products with no orders in the last 6 months.

Business Question

Are NovaMart’s products priced correctly and where are opportunities to adjust pricing to maximise revenue?

Methodology

Three analytical lenses: Price Distribution Analysis, Price vs Demand Relationship, and Revenue per Order Analysis.

Dataset used: df_transactions (524,878 records)

Price OPtimization Analysis

Show Code
# Product-level price and demand summary
price_analysis = df_transactions.groupby(
    ['StockCode', 'Description']
).agg(
    AvgPrice      = ('UnitPrice', 'mean'),
    MinPrice      = ('UnitPrice', 'min'),
    MaxPrice      = ('UnitPrice', 'max'),
    TotalQuantity = ('Quantity', 'sum'),
    TotalOrders   = ('InvoiceNo', 'nunique'),
    TotalRevenue  = ('TotalPrice', 'sum'),
    AvgOrderQty   = ('Quantity', 'mean')
).reset_index()

# Remove non-products
price_analysis = price_analysis[
    ~price_analysis['Description'].str.upper().isin(
        non_products
    )
].reset_index(drop=True)

# Revenue per order
price_analysis['RevenuePerOrder'] = (
    price_analysis['TotalRevenue'] /
    price_analysis['TotalOrders']
).round(2)

# Price variability — products sold at different prices
price_analysis['PriceVariability'] = (
    price_analysis['MaxPrice'] -
    price_analysis['MinPrice']
).round(2)

# Price bands
def price_band(price):
    if price < 1:
        return 'Under £1'
    elif price < 5:
        return '£1 to £5'
    elif price < 10:
        return '£5 to £10'
    elif price < 20:
        return '£10 to £20'
    elif price < 50:
        return '£20 to £50'
    else:
        return '£50 and above'

price_analysis['PriceBand'] = price_analysis['AvgPrice'].apply(
    price_band
)

# Summary by price band
band_summary = price_analysis.groupby('PriceBand').agg(
    Products      = ('StockCode', 'count'),
    TotalRevenue  = ('TotalRevenue', 'sum'),
    AvgOrders     = ('TotalOrders', 'mean'),
    AvgQtySold    = ('TotalQuantity', 'mean')
).round(2).reset_index()

band_order = ['Under £1', '£1 to £5', '£5 to £10',
              '£10 to £20', '£20 to £50', '£50 and above']
band_summary['PriceBand'] = pd.Categorical(
    band_summary['PriceBand'],
    categories=band_order, ordered=True
)
band_summary = band_summary.sort_values('PriceBand')

print("Revenue by Price Band:")
display(band_summary)
print("Top 15 Products by Revenue per Order:")
display(price_analysis.nlargest(15, 'RevenuePerOrder')[[
    'Description', 'AvgPrice', 'TotalOrders',
    'TotalRevenue', 'RevenuePerOrder'
]])
print("Highest Price Variability (possible pricing inconsistency):")
display(price_analysis[
    price_analysis['TotalOrders'] > 50
].nlargest(10, 'PriceVariability')[[
    'Description', 'MinPrice', 'AvgPrice',
    'MaxPrice', 'PriceVariability', 'TotalOrders'
]])
Revenue by Price Band:
PriceBand Products TotalRevenue AvgOrders AvgQtySold
0 Under £1 738 708388.68 112.99 2095.08
1 £1 to £5 2538 6524208.41 134.30 1409.12
4 £5 to £10 628 2068701.48 114.55 574.94
2 £10 to £20 198 823580.88 102.19 381.31
3 £20 to £50 37 57956.27 24.22 56.41
5 £50 and above 13 83245.52 13.38 29.85
Top 15 Products by Revenue per Order:
Description AvgPrice TotalOrders TotalRevenue RevenuePerOrder
2665 PAPER CRAFT , LITTLE BIRDIE 2.080000 1 168469.60 168469.60
1415 PICNIC BASKET WICKER 60 PIECES 649.500000 2 39619.50 19809.75
4136 ADJUST BAD DEBT 11062.060000 1 11062.06 11062.06
2870 TEA TIME TEA TOWELS 2.325000 2 6045.00 3022.50
2081 MISELTOE HEART WREATH CREAM 4.150000 1 996.00 996.00
276 SET/5 RED SPOTTY LID GLASS BOWLS 2.550000 1 734.40 734.40
1200 WEEKEND BAG VINTAGE ROSE PAISLEY 7.650000 1 527.85 527.85
1747 HALL CABINET WITH 3 DRAWERS 56.508000 5 2603.53 520.71
1744 UTILTY CABINET WITH HOOKS 21.771667 6 2610.29 435.05
894 POTTING SHED CANDLE CITRONELLA 3.850000 3 1268.19 422.73
2120 MEDIUM CERAMIC TOP STORAGE JAR 1.468480 247 81700.92 330.77
806 VINTAGE POST OFFICE CABINET 66.360000 2 607.65 303.83
180 LUNCH BAG RED SPOTTY 1.450000 1 290.00 290.00
1742 REGENCY MIRROR WITH SHUTTERS 156.428571 7 1530.00 218.57
1567 VINTAGE RED KITCHEN CABINET 143.421053 38 8125.00 213.82
Highest Price Variability (possible pricing inconsistency):
Description MinPrice AvgPrice MaxPrice PriceVariability TotalOrders
4137 CARRIAGE 15.00 49.879433 150.00 135.00 141
1418 CABIN BAG VINTAGE RETROSPOT 12.75 22.484118 58.29 45.54 68
3126 SET/4 WHITE RETRO STORAGE CUBES 34.95 45.232258 76.55 41.60 155
1422 SEWING BOX RETROSPOT DESIGN 3.95 13.275465 34.04 30.09 86
1708 CHEST OF DRAWERS GINGHAM HEART 4.00 18.919890 34.00 30.00 91
2490 BOTANICAL GARDENS WALL CLOCK 20.80 24.791807 49.96 29.16 83
1760 BREAD BIN DINER STYLE RED 5.00 17.699460 34.00 29.00 278
1340 REGENCY CAKESTAND 3 TIER 4.00 13.983936 32.04 28.04 1988
1865 WOODEN ADVENT CALENDAR RED 7.95 13.112482 34.00 26.05 137
1864 WOODEN ADVENT CALENDAR CREAM 8.00 18.158261 34.00 26.00 137

Price Optimization Visualization

Show Code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor('#0d0d1a')
fig.suptitle('Figure: Price Optimisation Analysis',
             color='#ffffff', fontsize=14, fontweight='bold')

# ── Plot 1: Revenue by Price Band ──
ax1 = axes[0]
ax1.set_facecolor('#13131f')

bands  = band_summary['PriceBand'].tolist()
revenue = band_summary['TotalRevenue'].tolist()
band_colors = ['#4d9fff', '#00c896', '#f0a500',
               '#ff6b6b', '#7b61ff', '#ff9d00']

bars = ax1.bar(range(len(bands)), revenue,
               color=band_colors, edgecolor='none', width=0.6)

ax1.set_xticks(range(len(bands)))
ax1.set_xticklabels(bands, rotation=30, ha='right',
                    color='#c8c8d8', fontsize=8)
ax1.set_ylabel('Total Revenue (£)', color='#c8c8d8')
ax1.set_title('Total Revenue by Price Band',
              color='#ffffff', fontweight='bold', pad=10)
ax1.tick_params(colors='#c8c8d8')
ax1.spines[:].set_color('#2a2a3a')
ax1.grid(color='#2a2a3a', linestyle='--',
         linewidth=0.6, axis='y')
ax1.yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'£{x/1000:.0f}K')
)
for bar, val in zip(bars, revenue):
    ax1.text(bar.get_x() + bar.get_width()/2,
             bar.get_height() + 20000,
             f'£{val/1000:.0f}K',
             ha='center', color='#ffffff', fontsize=8)

# ── Plot 2: Price Variability — Top Inconsistent Products ──
ax2 = axes[1]
ax2.set_facecolor('#13131f')

price_var = price_analysis[
    price_analysis['TotalOrders'] > 50
].nlargest(10, 'PriceVariability').copy()

price_var['ShortName'] = price_var['Description'].str[:28]

y = range(len(price_var))

ax2.barh(y, price_var['MaxPrice'],
         color='#ff6b6b', alpha=0.5,
         edgecolor='none', height=0.5,
         label='Max Price')
ax2.barh(y, price_var['AvgPrice'],
         color='#00c896', alpha=0.9,
         edgecolor='none', height=0.5,
         label='Avg Price')
ax2.barh(y, price_var['MinPrice'],
         color='#4d9fff', alpha=0.9,
         edgecolor='none', height=0.5,
         label='Min Price')

ax2.set_yticks(y)
ax2.set_yticklabels(price_var['ShortName'],
                    color='#c8c8d8', fontsize=8)
ax2.set_xlabel('Price (£)', color='#c8c8d8')
ax2.set_title('Price Variability — Top 10 Inconsistent Products',
              color='#ffffff', fontweight='bold', pad=10)
ax2.tick_params(colors='#c8c8d8')
ax2.spines[:].set_color('#2a2a3a')
ax2.grid(color='#2a2a3a', linestyle='--',
         linewidth=0.6, axis='x')
ax2.invert_yaxis()
ax2.legend(facecolor='#13131f',
           labelcolor='#c8c8d8', fontsize=9)

plt.tight_layout()
plt.show()

Price Optimisation Findings and Business Recommendations

The £1 to £5 band generates £6.52M — 61% of total revenue. The Regency Cakestand has been sold at prices ranging from £4.00 to £32.04 across 1,988 orders.

Recommendations:

  • Standardise pricing on all Critical inventory products immediately.
  • A selective 10% price increase on the top 50 products in the £1 to £5 band could generate approximately £650,000 in additional annual revenue.
  • Introduce a formal pricing governance process with documented base prices and minimum wholesale prices.

Conclusions and Recommendations

This report delivered a comprehensive business intelligence analysis across six themes, transforming 524,878 transaction records into actionable revenue intelligence for NovaMart Global.

Summary of Findings

Theme Key Finding Business Impact
Customer Segmentation 717 Champions generate 65% of revenue Critical concentration risk
Churn Prediction 33.4% of customers have gone silent 1,449 customers need intervention
Cross-Sell 194 association rules identified Basket size increase opportunity
Demand Forecasting November peak at £1.5M Stock planning must reflect seasonality
Inventory Replenishment 7 critical products with 100% consistency Zero stockout tolerance required
Price Optimisation £1 to £5 band drives 61% of revenue Pricing inconsistencies eroding margin

The Three Most Urgent Actions

1. Protect the Champions 717 customers generate £5.78M — 65% of total revenue. A VIP retention programme for this segment is not optional, it is a business continuity measure.

2. Fix the Pricing Inconsistencies The Regency Cakestand has been sold at prices ranging from £4.00 to £32.04 across 1,988 orders. A formal pricing governance framework must be implemented immediately.

3. Deploy Cross-Sell Recommendations 194 association rules are ready for deployment. This is the lowest effort, highest return recommendation in this report.

Data Limitations and Honest Constraints

  • Churn model performance (ROC-AUC 76.39%) reflects the honest limits of transaction data alone.
  • Demand forecasting was limited to a trend-only model due to only 13 months of available data.
  • Missing CustomerID on 24.93% of records limits customer-level analysis coverage.

Final Note to Richard Okafor

The data is clear. NovaMart is not a business with a revenue problem, it is a business with a revenue concentration problem. The tools to address it exist within the transaction history already collected. The recommendations in this report require no new data infrastructure, no new technology investment, and no external consultants. They require decisions and execution.

The analysis is complete. The next step is yours.


Report prepared by: Freda Erinmwingbovo
Prepared for: Richard Okafor, CEO, NovaMart Global
Date: February 2026
Tools: Python · pandas · scikit-learn · mlxtend · statsmodels · matplotlib · Quarto
Dataset: UCI Machine Learning Repository: Online Retail Dataset
541,909 original records · 524,878 clean records · 38 countries · December 2010 to December 2011

This report was produced to professional data science standards. All findings are reproducible from the code cells above. No results have been overstated or adjusted to appear more favourable than the data supports.